Write a short description about the course and add a link to your github repository here. This is an R markdown (.Rmd) file so you can use R markdown syntax. See the ‘Useful links’ page in the mooc area (chapter 1) for instructions.
Spring 2017
We meet each week on Thursday from 8am to 10am in Unioninkatu35 to learn about Data Science
During second week most interestingly I have been learning single and multiple regression analysis and model fitting
learning2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep = ",", header = T)
dim( learning2014 )
## [1] 166 7
head( learning2014 )
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
str( learning2014 )
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
This data comes from an international survey from a class of students enrolled to Introduction to Social Statistics (fall 2014). The data was collected between 3.12.2014 - 10.01.2015 and created on 14.01.2015. The sample size is 183 with 63 variables, however we selected variables of our interest and filtered the data for points > 0. After cleaning the data, we end up with 166 subjects and 7 variables to analyse.
library(GGally)
## Warning: package 'GGally' was built under R version 3.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
plot_data <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
plot_data
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
1) Total number of males are 50% less than females. 2) Attitute is much higher in males. 3) Deep and surface questions have negative correlations in males whereas in females it’s almost not correlating 4) Based on the summary analysis deep, surface, strategic questions and points are normally distributed as they have simillar mean and median values.
regression_model <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(regression_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
In this multivariate model, we are explaining the variable points against attitude, stra and surf i.e. the explainatory variables. Based on the regression model, points have significant relationship with attitude. stra and surf show no significant relationship with points R-squared value of 0.20 implies that the model can explain 20% or one-fifth of the variation in the outcome.
new_regression_model <- lm(points ~ attitude, data = learning2014)
summary(new_regression_model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
This univariate model shoes that points is significantly realted to attitude Multiple R-squared: 0.1151 R-squared = Explained variation / Total variation R-squared is always between 0 and 100%: 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. Multiple R-squared is used for evaluating how well the model fits the data. In this case, R-squared value of 0.11 implies that the model can explain only 11% of the variation in the outcome.
plot(new_regression_model, which = c(1, 2, 5), par(mfrow = c(2,2)) )
Assumptions of the model 1. How well the model descrices the variables we are interested in 2. Linearity: The target variable is modelled as a linear combination of the model parameters 3. Errors are normally disrtibuted, not correlated and have constant variance
Residual vs Fitted plot explains about variance in errors. We could see that some errors deviate from the regression line implying that there is issue with the model QQplot of our model shows that most points fall close to the line but some points are not close on the left hand side of the graph, hence the fit is somewhere near to the normality assumption. The model is reasonably okay. Leverage plot shows the impact of a single observation on the model. There are some observations (values of -3) that have a high impact on the model which is not good.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(boot)
alc = read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep = ",", header = T)
dim(alc)
## [1] 382 35
head(alc)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob
## 1 GP F 18 U GT3 A 4 4 at_home teacher
## 2 GP F 17 U GT3 T 1 1 at_home other
## 3 GP F 15 U LE3 T 1 1 at_home other
## 4 GP F 15 U GT3 T 4 2 health services
## 5 GP F 16 U GT3 T 3 3 other other
## 6 GP M 16 U LE3 T 4 3 services other
## reason nursery internet guardian traveltime studytime failures
## 1 course yes no mother 2 2 0
## 2 course no yes father 1 2 0
## 3 other yes yes mother 1 2 3
## 4 home yes yes mother 1 3 0
## 5 home yes no father 1 2 0
## 6 reputation yes yes mother 1 2 0
## schoolsup famsup paid activities higher romantic famrel freetime goout
## 1 yes no no no yes no 4 3 4
## 2 no yes no no yes no 5 3 3
## 3 yes no yes no yes no 4 3 2
## 4 no yes yes yes yes yes 3 2 2
## 5 no yes yes no yes no 4 3 2
## 6 no yes yes yes yes no 5 4 2
## Dalc Walc health absences G1 G2 G3 alc_use high_use
## 1 1 1 3 6 5 6 6 1.0 FALSE
## 2 1 1 3 4 5 5 6 1.0 FALSE
## 3 2 3 3 10 7 8 10 2.5 TRUE
## 4 1 1 5 2 15 14 15 1.0 FALSE
## 5 1 2 5 4 6 10 10 1.5 FALSE
## 6 1 2 5 10 15 15 15 1.5 FALSE
str(alc)
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
This data comes from student alcohol consumption dataset, Porto (Portugal).
The original sample size is 1044 with 32 variables, however we selected variables of our interest and filtered the data high alcohol usage > 2.
After cleaning the data, we end up with 382 subjects and 35 variables to analyse.
The 4 variables that I have chosen for the analysis are:
1. failures: When someone fails, the chances of drug/alcohol consumption is high
2. health: I don’t think that the state of health has any effect on alcohol consumption, however there might be a chance that unhealthy people tend to consume more alcohol
3. age: In general, adults of all ages drink, so I don’t think it accounts fot high alcohol consumption
4. freetime: This factor could play a significant role in alcohol consumption. If one is bored/have nothing to do, they may tend to consume more alcohol.
Numerically and graphically explore the distributions of chosen variables and their relationships with alcohol consumption
Box plots are made with and without sex differences
Comments and comparision of findings and result at the end of this section
alc %>% group_by(high_use, sex, failures) %>% summarise(count =n())
## Source: local data frame [16 x 4]
## Groups: high_use, sex [?]
##
## high_use sex failures count
## <lgl> <fctr> <int> <int>
## 1 FALSE F 0 136
## 2 FALSE F 1 14
## 3 FALSE F 2 3
## 4 FALSE F 3 4
## 5 FALSE M 0 98
## 6 FALSE M 1 8
## 7 FALSE M 2 2
## 8 FALSE M 3 5
## 9 TRUE F 0 32
## 10 TRUE F 1 2
## 11 TRUE F 2 5
## 12 TRUE F 3 2
## 13 TRUE M 0 50
## 14 TRUE M 1 14
## 15 TRUE M 2 1
## 16 TRUE M 3 6
failure = alc$failures
d=density(failure)
plot(d)
polygon(d, col="red", border="blue")
ga <- ggplot(alc, aes(x = high_use, y = failures))
ga + geom_boxplot()
g1 <- ggplot(alc, aes(x = high_use, y = failures, col =sex))
g1 + geom_boxplot()
alc %>% group_by(high_use, sex, health) %>% summarise(count =n())
## Source: local data frame [20 x 4]
## Groups: high_use, sex [?]
##
## high_use sex health count
## <lgl> <fctr> <int> <int>
## 1 FALSE F 1 24
## 2 FALSE F 2 15
## 3 FALSE F 3 39
## 4 FALSE F 4 34
## 5 FALSE F 5 45
## 6 FALSE M 1 11
## 7 FALSE M 2 14
## 8 FALSE M 3 24
## 9 FALSE M 4 13
## 10 FALSE M 5 51
## 11 TRUE F 1 6
## 12 TRUE F 2 9
## 13 TRUE F 3 5
## 14 TRUE F 4 5
## 15 TRUE F 5 16
## 16 TRUE M 1 5
## 17 TRUE M 2 5
## 18 TRUE M 3 15
## 19 TRUE M 4 12
## 20 TRUE M 5 34
health = alc$health
d=density(health)
plot(d)
polygon(d, col="red", border="blue")
gb <- ggplot(alc, aes(x = high_use, y = health))
gb + geom_boxplot()
g2 <- ggplot(alc, aes(x = high_use, y = health, col =sex))
g2 + geom_boxplot()
alc %>% group_by(high_use, sex, age) %>% summarise(count =n())
## Source: local data frame [22 x 4]
## Groups: high_use, sex [?]
##
## high_use sex age count
## <lgl> <fctr> <int> <int>
## 1 FALSE F 15 29
## 2 FALSE F 16 42
## 3 FALSE F 17 46
## 4 FALSE F 18 37
## 5 FALSE F 19 3
## 6 FALSE M 15 34
## 7 FALSE M 16 37
## 8 FALSE M 17 19
## 9 FALSE M 18 18
## 10 FALSE M 19 4
## # ... with 12 more rows
age = alc$age
d=density(age)
plot(d)
polygon(d, col="red", border="blue")
gc <- ggplot(alc, aes(x = high_use, y = age))
gc + geom_boxplot()
g3 <- ggplot(alc, aes(x = high_use, y = age, col = sex))
g3 + geom_boxplot()
alc %>% group_by(high_use, sex, freetime) %>% summarise(count =n())
## Source: local data frame [19 x 4]
## Groups: high_use, sex [?]
##
## high_use sex freetime count
## <lgl> <fctr> <int> <int>
## 1 FALSE F 1 14
## 2 FALSE F 2 30
## 3 FALSE F 3 72
## 4 FALSE F 4 35
## 5 FALSE F 5 6
## 6 FALSE M 1 2
## 7 FALSE M 2 17
## 8 FALSE M 3 44
## 9 FALSE M 4 34
## 10 FALSE M 5 16
## 11 TRUE F 2 5
## 12 TRUE F 3 19
## 13 TRUE F 4 15
## 14 TRUE F 5 2
## 15 TRUE M 1 2
## 16 TRUE M 2 10
## 17 TRUE M 3 21
## 18 TRUE M 4 25
## 19 TRUE M 5 13
freetime = alc$freetime
d=density(freetime)
plot(d)
polygon(d, col="red", border="blue")
gd <- ggplot(alc, aes(x = high_use, y = freetime))
gd + geom_boxplot()
g4 <- ggplot(alc, aes(x = high_use, y = freetime, col =sex))
g4 + geom_boxplot()
From the density plots, we can clearly see that only freetime is normally distribued. Other variables are skewed and don’t follow normal distribution.
From box plot, it seems that failures has an effect on alcohol consumption. Whereas health age an freetime don’t have a significant relationship with alcohol consumption.
Comparing it to the previously stated hypothesis, everything is in concordance except freetime as I initially thought that freetime affects alcohol consumption.
m <- glm(high_use ~ failures + health + age + freetime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + health + age + freetime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6185 -0.8271 -0.7121 1.2833 2.0237
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.18073 1.76260 -2.939 0.00329 **
## failures 0.33431 0.15015 2.226 0.02598 *
## health 0.08933 0.08489 1.052 0.29265
## age 0.16966 0.09955 1.704 0.08835 .
## freetime 0.31847 0.12057 2.641 0.00826 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 442.58 on 377 degrees of freedom
## AIC: 452.58
##
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept) failures health age freetime
## -5.18073413 0.33431490 0.08932739 0.16965556 0.31847032
OR <- coef(m) %>% exp
OR
## (Intercept) failures health age freetime
## 0.005623876 1.396982988 1.093438576 1.184896660 1.375022805
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
CI
## 2.5 % 97.5 %
## (Intercept) 0.0001664349 0.1699432
## failures 1.0399367080 1.8815762
## health 0.9274606199 1.2946599
## age 0.9760886814 1.4434958
## freetime 1.0888815402 1.7486579
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.005623876 0.0001664349 0.1699432
## failures 1.396982988 1.0399367080 1.8815762
## health 1.093438576 0.9274606199 1.2946599
## age 1.184896660 0.9760886814 1.4434958
## freetime 1.375022805 1.0888815402 1.7486579
In this logistic regression model, we are explaining the variable points i.e. failure, health, age and freetime against high alcohol use. Based on the regression model, failure and freetime have significant relationship with high alcohol use. health and age show no significant relationship with high alcohol use. Intercept is around -5.2, so if explanatory variables are 0, high alcohol use should be around -5.2.
Odds ratio for the significant variables failure and freetime are 1.39 and 1.37 respectively. It means that both failures and freetime increase the risk of high alcohol consumption by the factor of around 1.4.The condidence intervals are shown in the table above.
m <- glm(high_use ~ failures + freetime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ failures + freetime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4863 -0.8706 -0.7588 1.3198 1.9520
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0671 0.4175 -4.951 7.37e-07 ***
## failures 0.3845 0.1467 2.621 0.00876 **
## freetime 0.3231 0.1198 2.696 0.00702 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 446.38 on 379 degrees of freedom
## AIC: 452.38
##
## Number of Fisher Scoring iterations: 4
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 261 9
## TRUE 105 7
table(high_use = alc$high_use, prediction = alc$prediction)%>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.68324607 0.02356021 0.70680628
## TRUE 0.27486911 0.01832461 0.29319372
## Sum 0.95811518 0.04188482 1.00000000
plot=ggplot(alc, aes(x = probability, y = high_use, col = prediction))
plot+geom_point()
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2984293
The regression model gives better significance (p-values) than the previous model when the insignificant variables are excluded. Also, The model predicted FALSE when it actually was FALSE 68% of time and TRUE when it was TRUE 2% of time.The overall loss/wrong prediction of the model is 0.298 or around 30%
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2958115
The model error using 10 fold cross validation is 0.2958. The model doesn’t improve much after CV (0.298) and also couldn’t be made better to compared to the model in DataCamp (0.26). The reason is that the variables decide the outcome of the model and different chosen variables have different outcomes.
Predictors are famrel, freetime and goout
Model error = 0.249
m <- glm(high_use ~ famrel + freetime + goout, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + freetime + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6778 -0.7873 -0.5624 0.9771 2.4260
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.2158 0.6582 -3.367 0.000761 ***
## famrel -0.4493 0.1355 -3.316 0.000913 ***
## freetime 0.1863 0.1354 1.376 0.168824
## goout 0.7518 0.1226 6.132 8.66e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 400.42 on 378 degrees of freedom
## AIC: 408.42
##
## Number of Fisher Scoring iterations: 4
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2460733
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2513089
Didn’t plot the graph
m <- glm(high_use ~ famrel + freetime + goout + sex + failures + health + studytime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + freetime + goout + sex + failures +
## health + studytime, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7443 -0.7532 -0.4947 0.8129 2.6350
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.78476 0.81326 -2.195 0.028194 *
## famrel -0.47922 0.14178 -3.380 0.000725 ***
## freetime 0.07604 0.14055 0.541 0.588466
## goout 0.77220 0.12675 6.092 1.11e-09 ***
## sexM 0.70164 0.27045 2.594 0.009478 **
## failures 0.14677 0.17195 0.854 0.393360
## health 0.11681 0.09437 1.238 0.215777
## studytime -0.43556 0.17341 -2.512 0.012015 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 376.38 on 374 degrees of freedom
## AIC: 392.38
##
## Number of Fisher Scoring iterations: 4
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2277487
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2408377
Significant variables = famrel, goout, sex, studytime
Model error = 0.238
Dropping 3 insignificant variables : freetime, failures, health
m <- glm(high_use ~ famrel + goout + sex + studytime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + goout + sex + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5928 -0.7494 -0.4919 0.8250 2.7080
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2112 0.7142 -1.696 0.08991 .
## famrel -0.4496 0.1379 -3.260 0.00111 **
## goout 0.7947 0.1218 6.523 6.89e-11 ***
## sexM 0.7593 0.2655 2.860 0.00424 **
## studytime -0.4752 0.1703 -2.790 0.00527 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 379.27 on 377 degrees of freedom
## AIC: 389.27
##
## Number of Fisher Scoring iterations: 4
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2408377
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2408377
Model error = 0.24
Dropping goout to see the effect as it contributes to the highest p-value of the model
m <- glm(high_use ~ famrel + sex + studytime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + sex + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5550 -0.8415 -0.6547 1.2299 2.1761
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9178 0.5885 1.560 0.11884
## famrel -0.3257 0.1242 -2.623 0.00872 **
## sexM 0.7336 0.2471 2.968 0.00299 **
## studytime -0.4711 0.1563 -3.013 0.00258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 429.98 on 378 degrees of freedom
## AIC: 437.98
##
## Number of Fisher Scoring iterations: 4
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2958115
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2958115
Model error = 0.29
library(dplyr) library(ggplot2) library(tidyr) library(corrplot) library(reshape2) library(plotly)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
This data tells about the housing values in sururbs of Boston.
It has 506 entries and 14 variables.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
cor_matrix<-cor(Boston)
library(corrplot)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
From the correaltion matrix the findings are: 1. rad is highly positively correlated with tax. 2. dis is negatively correlated with index, nox and age. 3. lstat is negatively correlated with medv
crim = per capita crime rate by town zn = proportion of residential land zoned for lots over 25,000 sq.ft indus = proportion of non-retail business acres per town chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) nox = nitrogen oxides concentration (parts per 10 million) rm = average number of rooms per dwelling age = proportion of owner-occupied units built prior to 1940 dis = weighted mean of distances to five Boston employment centres rad = index of accessibility to radial highways tax = full-value property-tax rate per $10,000 ptratio = pupil-teacher ratio by town black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town lstat = lower status of the population (percent) medv = median value of owner-occupied homes in $1000s
hist(Boston$crim, col = "grey", main = "Distribution of crim")
hist(Boston$zn, col = "grey", main = "Distribution of zn")
hist(Boston$indus, col = "grey", main = "Distribution of indus")
hist(Boston$chas, col = "grey", main = "Distribution of chas")
hist(Boston$nox, col = "grey", main = "Distribution of nox")
hist(Boston$rm, col = "grey", main = "Distribution of rm")
hist(Boston$age, col = "grey", main = "Distribution of age")
hist(Boston$dis, col = "grey", main = "Distribution of dis")
hist(Boston$rad, col = "grey", main = "Distribution of rad")
hist(Boston$tax, col = "grey", main = "Distribution of tax")
hist(Boston$ptratio, col = "grey", main = "Distribution of ptratio")
hist(Boston$black, col = "grey", main = "Distribution of black")
hist(Boston$lstat, col = "grey", main = "Distribution of lstat")
hist(Boston$medv, col = "grey", main = "Distribution of medv")
The variables that follow normal distribution are rm and medv. It can be also seen from summary for these variables as mean and median are closer.
##### Scale the variables
boston_scaled <- scale(Boston)
##### Summaries of the scaled variables
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
##### Crime rate from scaled Boston dataset
#### change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
##### create a quantile vector of crim
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
##### Create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
##### Remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
##### Add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
##### Creatde test and training data
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
##### linear discriminant analysis
##### Using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2326733 0.2673267 0.2574257 0.2425743
##
## Group means:
## zn indus chas nox rm
## low 0.8791264 -0.9045301 -0.10479289 -0.8483392 0.41277841
## med_low -0.1076981 -0.2957170 -0.01714665 -0.5726957 -0.15499608
## med_high -0.3874684 0.1994669 0.25766519 0.3864083 0.09429065
## high -0.4872402 1.0171960 -0.11163110 1.0945642 -0.34622987
## age dis rad tax ptratio
## low -0.8885287 0.8018394 -0.6862020 -0.7521099 -0.38878752
## med_low -0.3646583 0.3782031 -0.5511961 -0.5016816 -0.08210637
## med_high 0.4344042 -0.4117021 -0.3899690 -0.2829589 -0.26593090
## high 0.8062673 -0.8583516 1.6373367 1.5134896 0.77985517
## black lstat medv
## low 0.3871581 -0.78635419 0.492888966
## med_low 0.3544127 -0.13215282 0.006399878
## med_high 0.1343850 0.07132562 0.127431565
## high -0.9917623 0.95383950 -0.811052791
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.09633945 0.748274008 -0.97388903
## indus 0.02394773 -0.265236301 0.38655986
## chas -0.09469082 -0.069321904 0.05219572
## nox 0.47678460 -0.677350247 -1.36315105
## rm -0.03184565 -0.111710067 -0.18870498
## age 0.26107018 -0.356740518 -0.08614862
## dis -0.06765669 -0.256146080 0.32286415
## rad 2.97459552 0.958728294 0.17276309
## tax -0.05582287 -0.083235977 0.40493560
## ptratio 0.11511201 0.043167203 -0.33734675
## black -0.18962463 -0.002586388 0.16288448
## lstat 0.22232096 -0.264985128 0.42990797
## medv 0.14509551 -0.486993420 -0.08012524
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9435 0.0418 0.0147
##### biplot
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
##### save the correct classes from test data
correct_classes <- test$crime
##### remove the crime variable from test data
test <- dplyr::select(test, -crime)
classes <- as.numeric(test$crime)
##### predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
##### cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 19 11 3 0
## med_low 3 11 4 0
## med_high 1 9 12 0
## high 0 0 0 29
data("Boston")
boston_scaled <- scale(Boston)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4830 12.6100 13.5500 17.7600 48.8600
# k-means clustering
km <-kmeans(dist_eu, centers = 15)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
# determine the number of clusters
set.seed(123)
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:k_max, twcss, type='b')
# k-means clustering
#### After calculating total within sum of squares and plotting it, sharpest drop is between 1 and 2, so 2 is probably the optimal cluster amount.
km <-kmeans(dist_eu, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)